Comparative study of oral and written French automatically tagged with morpho-syntactic information

نویسنده

  • Véronique Gendner
چکیده

In this paper, we investigate automatic tagging of French corpora and compare morpho-syntactic properties of spoken and written language on corpora from different sources. Morpho-syntactic properties are first described according to the distribution of the 8 main POS in five corpora of about 1 million words each. The automatic tagging was made with about a hundred tags and we will describe the distinctions they allow and the reason why they were chosen. We will further discuss variation of the distinction common / proper noun and some distinctions made on the verb category . For this comparison, corpora of about 40 million words were used. These larger corpora have also been used to study the influence of corpus size on vocabularies. Our study on French shows that sources in the news domain have about 36% of noun-like items (nouns and pronouns). This strongly correlates with Hudson’s earlier studies on the English Brown and LOB corpora. A task-specific dialog corpus shows the highest proportions of 43% of noun-like items. Spoken news shows about 5% less nouns and 5% more pronouns than written news.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Knowledge integration in a robust and efficient morpho-syntactic analyser for French

Lorne H. Bouchard D6p. de math6matiques et d'infonnatique Universit6 du Qu6bec h MontrEal C.P. 8888, Succursale "A" Montr6al, QC Canada H3C 3P8 R15320 @ UQAM. BITNET We present a morpho-syntactic analyzer for French which is capable of automatically detecting and of correcting (automatically or with user help) spelling mistakes, agreement errors and certain frequently encountered syntactic erro...

متن کامل

Using Verb-Noun Patterns to Detect Process Inputs

We present the preliminary results of an ongoing work aimed at using morpho-syntactic patterns to extract information from process descriptions in a semi-supervised manner. The experiments have been designed for generic information extraction tasks and evaluated on detecting ingredients from cooking recipes in French using a large gold standard corpus. The proposed method uses bi-lexical depend...

متن کامل

Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian

The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don’t provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with ...

متن کامل

Comparison of the high-frequency morpho-syntactic structures of cochlear implant children and children with normal hearing aged 4-6 years

Introduction: Children with cochlear implants experience problems at all language domains, and have more problems in morpho-syntactic skills than others domains. Considering the importance of morphology and syntax in developing of communication skills of children, this study compared the use of high-frequency morpho-syntactic structures among 4-6 years old children with cochlear implants and ty...

متن کامل

A Study on Morpho-Syntactic Patterns: A Cohesive Device in Some Persian Live Sport Radio and TV Talks

Morpho-syntactic patterns device encompasses a subcategory of the cohesive devices that assists hearers to have an adequate mental representation for understanding speech. This article investigates the morpho-syntactic patterns employed in some Persian live sport radio and TV programs adapting Dooley and Levinsohn’s theoretical and analytical framework. The research data includes around 30,000 ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002